-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Optimize gemv_n_sve_v1x3 kernel #5292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
kernel/arm64/gemv_n_sve_v1x3.c
Outdated
pg00 = svand_z(SV_TRUE(), pg0, pg00); | ||
pg01 = svand_z(SV_TRUE(), pg0, pg01); | ||
pg02 = svand_z(SV_TRUE(), pg0, pg02); | ||
svbool_t pg_tail = SV_WHILE(i, m); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to pre-calculate this predicate outside of the loop ?
This is re-used again below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think since we are calculating the predicate for the tail elements , it depends on i value , so if we remove outside of the loop then we have to calculate for (0 , m % sve_size) but that can go wrong sometime , since we want from (i, m) and not from 0 , whats your thought on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why (0 , m % sve_size) wouldn't work since we increment i by sve_size in the main loop. Please also soo https://github.com/OpenMathLib/OpenBLAS/pull/5089/files#diff-d0b63f332b08eef9b57a1eec785ff43afc468108c60f237b0c4e9401df08b510R68
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it will work, but its just that it will give the predicate for (0 , m % sve_size) index rather than the correct index as (i , m) , sure will make this change , Thanks.
CodSpeed Performance ReportMerging #5292 will improve performances by 10.54%Comparing Summary
Benchmarks breakdown
|
- Calculate predicate outside the loop - Divide matrix in blocks of 3
1ed7eb6
to
8279e68
Compare
LGTM. @martin-frbg any further comment ? |
x-axis -> M = N
y-axis -> GFLOPS (timing)